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The adaptive immune response selectively expands B- and T-cell clones following antigen recognition by B- and T-cell receptors 
(BCR and TCR), respectively. Next-generation sequencing is a powerful tool for dissecting the BCR and TCR populations at high 
resolution, but robust computational analyses are required to interpret such sequencing. Here, we develop a novel computa- 
tional approach for BCR repertoire analysis using established next-generation sequencing methods coupled with network 
construction and population analysis. BCR sequences organize into networks based on sequence diversity, with differences in 
network connectivity clearly distinguishing between diverse repertoires of healthy individuals and clonally expanded repertoires 
from individuals with chronic lymphocytic leukemia [CLL] and other clonal blood disorders. Network population measures 
defined by the Gini Index and cluster sizes quantify the BCR clonality status and are robust to sampling and sequencing depths. 
BCR network analysis therefore allows the direct and quantifiable comparison of BCR repertoires between samples and intra- 
individual population changes between temporal or spatially separated samples and over the course of therapy. 



[Supplemental material is available for this article.] 

Healthy humans have —3x10^ B-cells in the peripheral blood, and 
this population consists of a repertoire of distinct B-cells expressing 
different B-cell receptors (BCRs) necessary to bind diverse antigens 
and produce an effective humoral immune response. BCRs consist 
of two identical heavy-chain (IGH) and two identical light-chain 
proteins, where the antigen-binding regions are highly diversified 
(Tonegawa 1983; Woof and Burton 2004). BCR diversity is generated 
in a number of ways. The IGH gene locus encodes for multiple dis- 
tinct copies of the variable (V), diversity (D), and joining (J) gene 
segments (Jung et al. 2006), with functional IGH BCR genes gen- 
erated by site-specific V-D-J recombination (Latchman 2005; Schatz 
and Swanson 2010). The imprecise joining of the V-D-J gene seg- 
ments leads to random deletion and insertion of nucleotides during 
recombination events, resulting in sequence diversification at the 
junctional regions (Fig. lA). Rearranged BCR genes are further di- 
versified by helper T-cell-mediated somatic hypermutation (SHM) 
through the action of activation-induced cytosine deaminase. 
Through clonal affinity selection for enhanced antigen binding, 
non-germ-line SHM-mediated variation contributes significantly to 
the diversification of the mature B-cell repertoire (Brezinschek et al. 
1995; Corner et al. 1998; Weinstein et al. 2009; Batrak et al. 2011). 

The diversification and selection dynamics of BCR repertoires 
in healthy individuals and those with infection, autoimmunity, 
immunodeficiency, or B-cell malignancies remain poorly under- 
stood but can have important clinical implications. For example, 
the majority of B-cells in individuals with B-cell malignancies typ- 
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ically express a single dominant clonal BCR sequence (Arber 2000; 
Campbell et al. 2008), and continued intraclonal tumor evolution by 
SHM in patients with B-cell lymphomas has been observed (Stama- 
topoulos et al. 1996; Bagnara et al. 2006; Volkheimer et al. 2007). 
Importantly, patients with chronic lymphocytic leukemia (CLL) with 
mutated BCR sequences in the tumor clone compared with the germ 
line have a prognostically inferior survival rate and requirement of 
early treatment compared with those with unmutated malignant 
clones (Caligaris-Cappio and Ghia 2008). The BCR sequence reper- 
toire of an individual therefore represents a surrogate of their B- 
cell clonality status in health and disease, with the potential to 
give new insights into the adaptive immune response as well as 
providing diagnostic and prognostic power when used clinically. 

Previous studies have mainly produced descriptive analyses of 
the BCR populations. Isoelectric focusing (lEF) spectrotype methods 
(Williamson et al. 1973; Rieben et al. 1996; Satoh et al. 1996) pre- 
ceded the advent of sequencing technologies (Arnaout et al. 2011) 
and are not quantitative. The potential size of the human repertoire 
is estimated to be 10^^ unique BCR sequences; therefore, deep, high- 
throughput sequencing is necessary for sampling this repertoire 
robustly and to identify different subsets of BCRs (Dimitrov 2010; 
Benichou et al. 2012). There are several methods for isolation, am- 
plification, and sequencing of B-cell repertoires. Multiplex PCR 
amplification, using degenerate PCR primers complementary to 
germ-line V and J segments have been designed and validated pre- 
viously (van Dongen et al. 2003; Lukowsky et al. 2006; Bruggemann 
et al. 2007; Evans et al. 2007; van Krieken et al. 2007; Vargas et al. 
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Figure 1. Sequencing of B-cell receptor repertoires. (A) Representation of the genomic rear- 
rangement process during V-D-J recombination to generate the heavy-chain B-cell receptor. B-cell 
receptor amplification was performed by reverse transcription on total RNA by single J region 
primer, and subsequent multiplex PCR amplification. (B) The percentage of reads corresponding to 
the highest expressed B-cell receptor sequence for each sample, separated into sample type: healthy 
individuals, chronic lymphocytic leukemia patients (CLL), and human lymphoblastoid cell lines 
(LCL). Two-sided f-tests were performed between the sample subsets, with the P-values indicated 
above. (C) Percentage of sequences shared between runs for technical repeats for (1) the RT-PCR 
and resequencing (RT-PCR repeats, green bars), and (2) the 454 sequencing from the same RT-PCR 
product (sequencing repeats, purple bars). For each sample, two repeats were performed and the 
percentage of reads shared between the repeats is shown (each repeat is compared with the other, 
so two bars are shown per sample). 



2008), used in numerous biological studies (Sanchez et al. 2003; 
Campbell et al. 2008; Boyd et al. 2009, 2010; Krause et al. 201 1; Jager 
et al. 2012; Lev et al. 2012; Maletzki et al. 2012), and optimized for 
clinical use (McClure et al. 2006; Harris et al. 2012; Sproul and 
Goodlad 2012), although the potential for biased PCR amplification 
remains. The 5' rapid amplification of cDNA ends (5' RACE) has also 
been used (Bertioli 1997; Freeman et al. 2009; Varadarajan et al. 
2011; Warren et al. 2011), but can suffer from low efficiency and 
high levels of nonspecific amplification, contamination by short 
fragments from RNA degradation, or incomplete cDNA synthesis. 
Both methods utilize PCR and therefore have a risk of systematic 
over/under-representation of immunoglobulin sequences either 
through different primer annealing or different amplification effi- 
ciencies of the distinct V families (Sandberg et al. 2005). 

Previous studies have qualitatively shown diverse IGH rep- 
ertoires in healthy patients contrasting with clonal populations in 
malignancies (Sanchez et al. 2003; Campbell et al. 2008; Boyd et al. 
2009; Carulli et al. 2011; Logan et al. 2011; Maletzki et al. 2012) 
and have also shown that distinct subsets of B-cells within the 
same individual have distinct repertoires (Wu et al. 2010). To date. 



next-generation sequencing (NGS) of 
BCRs have primarily focused on classify- 
ing the IGHV, D, and / recombination 
frequencies to understand the diversity of 
the BCR repertoire (Stewart et al. 1997; 
Sanchez et al. 2003; Campbell et al. 2008; 
Boyd et al. 2009, 2010; Weinstein et al. 
2009; Wu et al. 2010; Jager et al. 2012; Lev 
et al. 2012; Maletzki et al. 2012). How- 
ever, computational assignment of V-D-J 
sequences to reference databases results 
in many incompletely assigned IGHV, D, 
and / genes, even when the germ-line al- 
leles are known (Weinstein et al. 2009). 
This is most likely due to SHM masking 
the identity of the germ-line genes present 
in the NGS, or the existence of allelic vari- 
ation relative to the reference IGH genes. 
Further, investigation of V-D-J gene usage 
frequencies utilizes only part of the BCR 
sequence diversity, with important infor- 
mation about the V-D-J joining regions and 
mutational relationships not considered. 

Here, we propose that analysis of the 
BCR sequence relationships using the full 
BCR V-D-J sequence is more informative 
for human BCR repertoire analysis than 
V-D-J gene classification. We show that 
human BCR repertoire diversity can be 
interpreted through full V-D-J genotype 
diversity using BCR networks, previously 
shown to be an intuitive way for un- 
derstanding B-cell repertoires in zebrafish 
(Ben-Hamo and Efroni 2011). In such 
networks, the lowest level of organization 
in a population of B-cells, namely, inde- 
pendent B-cells, is represented by sparse 
networks, whereas highly developed 
(connected) networks most likely result 
from clonal expansions of B-cells arising 
through antigenic exposure or B-cell ma- 
lignancies (Ben-Hamo and Efroni 2011). 
Using degenerate PCR-based methods we focus on sequenc- 
ing RNA populations to maximize analysis of functionally re- 
arranged BCRs rather than any nonfunctional first BCR allele 
defective rearrangements present in the genomic DNA from B- 
cell populations, but with the disadvantage that unequal num- 
bers of RNA molecules per cell have the potential to inflate or de- 
flate detected B-cell populations in the repertoire. Through se- 
quencing the BCRs from samples with clonally expanded B-cell 
populations (peripheral blood from patients with CLL and human 
lymphoblastoid cell lines [LCLs]) as well as diverse BCR pop- 
ulations from peripheral blood from healthy individuals, we show 
that network analysis provides a robust framework to understand 
vast sequencing repertoires by sequence relationships that clearly 
distinguish between B-cells quantitatively using network mea- 
sures. This framework is complimentary to existing phylogenetic 
methods, and we show, for the first time, B-cell tumor clone evo- 
lution over the course of therapy. These methods are robust to 
sampling and sequencing depths as well as different sequencing 
technologies, thereby allowing the direct comparison of multiple 
tumor samples from the same and different patients. 
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Results 

Next-generation sequencing of IgH variable genes 

We amplified by RT-PCR the expressed rearranged IGHV-D-J loci 
from mRNA from human B-cell populations using the consensus 
IGHJ primer and FRl or FR2 IGHV family primers (Fig. lA; Sup- 
plemental Table SI; van Dongen et al. 2003). Peripheral blood (PB) 
samples from 13 healthy individuals, 11 CLL patients, and eight 
LCLs yielded PGR products of expected sizes (310-360 bp for FRl 
and 250-295 bp for FR2 primed samples) and were 454 sequenced 
(Table 1). Samples yielded an average of 42,324 sequencing reads 
after filtering for quality and presence of IGH sequence (Supple- 
mental Table S2). Two additional samples from CLL patient A (pre- 
and post-treatment) were sequenced on the MiSeq platform. We 
also analyzed the BCR 454 sequence data sets from Boyd et al. 
(2009); which includes three healthy individuals and five patients 
with clonal blood disorders (Supplemental Table S6). 

The combined per-base error-rate for the RT-PCR and se- 
quencing process for the 454 platform was similar to other studies 
(Wang et al. 2007; Boyd et al. 2009) (1.74 X 10"^, of which ho- 
mopolymeric indels and nonhomopolymeric errors accounted 
for 59.7% [1.04 X 10"^] and 40.3% [7.04 X 10"^] of the total enor- 
rate, respectively). Similarly, the combined per-base error-rate for 
MiSeq was 2.06 X 10"^. 

To initially assess the clonality of our samples, we determined 
the percentage of reads identical to the most abundant BCR se- 
quence in each sample. The percentage of reads corresponding 
to the highest expressed BCR sequence in each of the CLL and 
LCL samples (range 39.3%-87.8% and 35.2%-78.7%, respectively) 
were significantly higher than that of PB from healthy individuals 
(range 0.10%-14.0%) with a P-value of <0.001 (Fig. IB). There was 
no significant difference in the percentage of identical reads be- 
tween the LCL and CLL patient samples (P-value = 0.0594). There- 
fore, we confirm that the healthy individuals represent diverse BCR 



Table 1. 


Sample information 








Sample 


Patient type 


Age, years 


Gender 


Time since CLL diagnosis, years 


CLL1 


CLL 


77 


Male 


7 


CLL 2 


CLL 


58 


Male 


2 


CLL 3 


CLL 


78 


Male 


1.5 


CLL 4 


CLL+HCC 


77 


Male 


2.5 


CLL 5 


CLL 


59 


Female 


1.25 


CLL 6 


CLL 


67 


Male 


2 


CLL7 


CLL 


69 


Male 


13 


CLL 8 


CLL 


64 


Male 


4.5 


CLL 9 


CLL 


77 


Male 


5.25 


CLL 10 


CLL 


81 


Male 


8 


CLL 11 


CLL 


81 


Male 


10 


Healthy 1 


Age matched control 1 


74 


Female 




Healthy 2 


Age matched control 2 


62 


Female 




Healthy 3 


Age matched control 3 


75 


Female 




Healthy 4 


Age matched control 4 


67 


Female 




Healthy 5 


Age matched control 5 


68 


Female 




Healthy 6 


Healthy 6 


55 


Male 




Healthy 7 


Healthy 7 


23 


Male 




Healthy 8 


Healthy 8 


23 


Male 




Healthy 9 


Healthy 9 


25 


Male 




Healthy 1 0 


Healthy 1 0 


24 


Female 




Healthy 1 1 


Healthy 1 1 


24 


Female 




Healthy 12 


Healthy 12 


24 


Female 




Healthy 1 3 


Healthy 1 3 


24 


Female 





(CLL) Chronic lymphocytic leukemia; (HCC) hepatocellular carcinoma. 



populations, whereas the LCL and CLL samples represent more re- 
stricted or clonal BCR populations. Sanger and MiSeq sequencing 
confirmed that the dominant clonal sequences from the CLL sam- 
ples were identical to that from 454 sequencing (excluding homo- 
polymeric indels) (Supplemental Fig. S2). 

Validation of sequencing to represent the B-cell populations 

To assess the reproducibility of the RT-PCR sequencing method to 
sample the BCR repertoire, we compared the number of over- 
lapping sequences from two types of technical repeats on a range 
of samples: (1) repeating the RT-PCR and resequencing (RT-PCR 
repeats) and (2) repeating the 454 sequencing from the same RT- 
PCR product (sequencing repeats). The percentage of the sequences 
shared (no more than 1-bp difference) between sequencing runs 
was calculated using all-against-all alignments. This showed over 
98% and 30% reproducibility for LCL and healthy PB samples, 
respectively (Fig. IC), probably due to the increased probability of 
resampling more abundant BCR types in LCLs. It is clear that the 
sequencing overlaps between RT-PCR repeats (Fig. IC, green bars) 
are not significantly different from those between sequencing re- 
peats (Fig. IC, purple bars) (P-value = 0.738 by paired t-test), sug- 
gesting that our RT-PCR amplification and sequencing depth is 
sufficient to be representative of the major clonal BCR population 
in the sample. 

Comparison between independent primer sets suggests limited 
primer bias 

To assess whether multiplex PCR methods cause significant PCR 
amplification bias, we determined the correlation between IGHV 
gene usages for samples independently amplified by the FRl and/or 
FR2 primer sets. The IGHV gene usage frequency distributions for 
both healthy individuals and LCLs resemble those seen by Arnaout 
et al. (201 1) (Supplemental Fig. S3A,B). /Gffy gene usage frequencies 
between FRl and FR2 amplified samples 
from eight LCL samples and four healthy 
individual samples are highly correlated 
(i^-value = 0.984, Supplemental Fig. S3C) 
suggesting that there is minimal primer 
amplification bias. Performing RT-PCR re- 
peats using the FRl primer set measures 
the stochastic effect of resampling the 
RNA populations, and again we found 
a strong correlation between IGHV gene 
usage frequencies between replicates (R- 
value = 0.972, Supplemental Fig. S3D). 
Importantly, for both of these compari- 
sons, we see a strong linear correlation 
between IGHV gene usage frequencies 
when percentages are >5% of the overall 
population, but BCR sequences at a fre- 
quency of 0%-5% show some effect of 
stochastic sampling, irrespective of primer 
set usage. Therefore, overall, we do not see 
significant primer amplification bias un- 
der the conditions used here. 



Limitations of V-D-] classification 

We classified the V-D-J genes for each read 
by sequence similarity to germ-line se- 
quences from the ImMunoGeneTics da- 
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A) 
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Bi) 

LCL sample 1 

58,117 sequences total 
99.6 % maximum cluster 




D) 

CLL sample 5 

26,086 sequences total 
48.0 % maximum cluster 



tabase (IMGT) (Lefranc et al. 2009; Sup- 
plemental Fig. S4). The majority of se- 
quences could be classified to their most 
closely related reference sequences for 
IGHV and IGHJ genes (average of 99.8% 
and 96.1%, respectively). Substantially 
fewer IGHD were identifiable (average of 
40.5%) due to the shorter sequence length 
and potential insertions and deletions 
within the joining regions between the 
V-D-J boundaries, which has been noted 
in previous studies (Weinstein et al. 2009). 
Incomplete V-D-J gene classification may 
be due to SHM masking the identity of 
the germ-line genes present in individuals 
and/or the existence of allelic variants of 
reference IGH (Boyd et al. 2010). We find 
no significant difference between the 
percentage of classified V, D, and J genes 
of our data set compared with that of 
Boyd et al. (2009) (Supplemental Fig. S4). 
To overcome limitations of IGHV-D-J gene 
classification, we propose that the use of 
complete V-D-J sequence information and 
mutational relationships would be a more 
informative and robust framework for 
BCR repertoire analysis and B-cell pop- 
ulation structure than simple V-D-J gene 
classification. 



BCR sequences naturally organize into 
networks based on sequence diversity 

For each sample, filtered and trimmed 454 
or MiSeq sequences were used directly to 
generate a sequence network (Fig. 2A). 
Each vertex represents a different se- 
quence, and the number of identical BCR 
sequences defines the vertex size. Edges 
are created between vertices that differ by 
one nucleotide. Clusters are groups of 
interconnected vertices. Differences in 
network architectures are clearly seen by 
comparing B-cell populations from healthy 

individuals and LCLs. In LCLs, the majority of 454 sequences fall 
within a small number of clusters as these samples are pre- 
dominantly comprised of a small number of large B-cell clone types 
(Fig. 2B,i). In contrast, healthy individuals have sparsely connected 
networks where most sequences are unique, thus yielding small 
vertices indicative of high BCR sequence diversity in the sampled 
repertoire (Fig. 2B,ii). From healthy individuals, 4.8%-32.2%of BCR 
sequences fall within clusters of three or more reads, with the largest 
cluster representing 16.7% (4023 reads) of the total population in 
healthy individual 10. Sequences within a cluster are most likely 
related to a single V-D-J BCR progenitor. Alignment of sequences 
within the clusters shows that the nucleotide differences are dis- 
tributed along the length of the 454 sequences (Supplemental Fig. 
S5A-C). In all of the healthy individual samples, mutations sig- 
nificantly occur within the complementary determining regions 
(CDRs), known to be hotspots for somatic hypermutation (Lin et al. 
1997) compared with the framework regions (FWRs) (P-value = 
0.000338, Supplemental Fig. S5D). As expected, the LCLs showed no 
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Figure 2. B-cell receptor repertoires fronn different samples. {A) Schennatic diagram showing the 
method by which the sequencing networks are generated: Each vertex represents a unique sequence, 
where the relative size of the vertex is proportional to the number of 454 sequencing reads that were 
identical to the vertex sequence. Edges are created between vertices that differ by one base (indel or 
substitution). The vertex colors correspond to the relative abundance of the corresponding sequences, 
where red, orange, and yellow indicates observation of a sequence in >90%, between 40%-90%, and 
<40% of the reads in the sample, respectively. (S) Comparison of BCR sequence networks between (i) 
atypical LCL sample and (ii) a typical healthy individual. (C) BCR sequence networks of CLL patients with 
(i) extensive clonal enlargement and (ii) limited clonal expansion. (D) BCR sequence networks of CLL 
patient 5 showing expansion of two dominant clusters. (£) Networks generated from sequencing data 
set from Boyd et al. (2009) of (i) healthy donor 1, (ii) patient 2 with follicular lymphoma (FL), and (iii) 
patient 3 with FL and small lymphocytic lymphoma (SLL). 



significant difference between the mutational proportions of the CDR 
and FWR regions. Mismatches are not found primarily at the V-D or 
D-J boundary, suggesting that cluster sequences are derived from the 
same B-cell precursor. 

The maximum CLL vertex sizes differ between samples 
(39.2%-99.5% of total sequences), suggesting that large but var- 
iable-sized subsets of B-cells express the predominant BCR se- 
quence(s), surrounded by BCR variants (including total process 
errors) of the dominant sequence. Of note, the extent of cluster- 
size diversity is different between CLL samples, with some dis- 
playing extensive clonal enlargement (Fig. 2C,i), whereas others 
have more limited clonal expansion (Fig. 2C,ii) and expansion of 
two dominant clusters (Fig. 2D). Similar networks were generated 
from Boyd et al.'s (2009) data set (Fig. 2E). Therefore, the magni- 
tude of connectivity of different samples varies between individual 
patients with CLL. In all cases, however, the CLL sequence networks 
are clearly distinct from the largely sparsely connected age-matched 
healthy individuals. 
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Population measures capture network and sample diversity 

We next aimed to quantify BCR population measures to allow 
comparison and interpretation of B-cell repertoire dynamics and 
biology. We investigated several parameters to describe different 
aspects of the B-cell populations. The Gini Index is an unevenness 
measure. When applied to the vertex size distribution for a given 
sample, these measures indicate the overall clonal nature of a sample, 
and when applied to the cluster-size distributions, these measures 
indicate the overall SHM of a sample. The maximum cluster size is the 
percentage of reads corresponding to the largest cluster and indicates 
the degree of clonal expansion of a sample. To assess the possibility of 
dual clonal expansions, we include a measure of the second maxi- 
mum cluster size as a percentage of reads in a sample. 



The LCL samples, due to the more restricted repertoires and 
highly connected clusters, yield high cluster and vertex Gini In- 
dices (averages of 0.94 and 0.80, range 0.91-0.97 and 0.62-0.91, 
respectively) (Fig. 3 A), suggesting a high unevenness of the size 
distributions. In contrast, B-cell networks of healthy individuals 
occupy a distinct region of the Gini Index vertex and cluster space 
(averages of 0.21 and 0.05, range 0.10-0.39 and 0.03-0.11, re- 
spectively). The CLL samples occupy a spatial range between 
healthy individuals and LCL B-cell population extremes with a low 
vertex (between 0.62 and 0.97), and cluster Gini Indices (between 
0.15 and 0.83), indicating B-cell clonal expansions. There is, 
however, considerable variation between the cluster Gini Indices, 
with CLL patients 1, 10, and 11 having low-cluster Gini Indices, 
indicative of a highly expanded dominant cluster or dominant 
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Figure 3. Measures differentiating between B-cell receptor populations. (A) Cluster Gini Index plotted against vertex Gini Index fori 3 healthy individual 
samples, 1 1 chronic lymphocytic leukemia (CLL), and eight human lymphoblastoid cell line (LCL) samples. Point (a) corresponds with healthy individual 
10. The red box and gray dashed box distinguish between the regions occupied between diverse and clonal populations, respectively. (B) The second 
maximum cluster sizes plotted against the maximum cluster sizes. The red, gray-dashed, and black solid boxes distinguish between the regions occupied 
between unexpanded populations, monoclonal expanded populations, and biclonally expanded populations, respectively. (C) B-cell receptor networks 
for the titration of a chronic lymphocytic leukemia clonal sample into healthy peripheral blood from the data set from Boyd et al. (2009), and (D) the 
corresponding number of reads corresponding to the leukemic clone (green) and the maximum cluster size of each dilution (gray). The solid horizontal 
line shows the mean maximum cluster size for healthy individuals from this data set (0.52% of total reads), and the dashed horizontal lines show the 
mean ±SD of maximum cluster size for healthy individuals for this data set. (f) Correlation between the Gini Index and the length of time since 
chronic lymphocytic leukemia (CLL) diagnosis for each patient in our data set, with corresponding /?^-value. 
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clones. Of note, one healthy individual (healthy individual 10) has 
a more developed network as defined by an increase in connec- 
tivity and vertex sizes, resulting in higher vertex and cluster Gini 
Indices (Fig. 3 A, point a). This was also verified by independent 
sequencing using the FR2 primer set (Supplemental Fig. S6). Fur- 
ther, the highest expressed BCR sequence for healthy individual 10 
has 90.6% sequence identity with the closest germ-line IGHVgene 
(16 mismatches in 243 bp of alignment) suggesting that this B-cell 
clone has undergone SHM. 

We generated networks from the sequences derived from 
Boyd et al. (2009) to validate these population measures on in- 
dependent BCR sequence data. We show that the clonal pop- 
ulations of the patients with CLL, small lymphocytic lymphoma 
(SLL), and/or follicular lymphoma (FL) are distinct from the diverse 
populations of healthy individuals (Supplemental Fig. S7A), oc- 
cupying equivalent regions of the cluster and vertex Gini Index 
graphs to samples within this study. Therefore, we conclude that 
the Gini Index population measure robustly separates distinct 
B-cell populations into different regions based on the clonal nature 
of the sample. 

We then determined whether we could separate monoclonal 
expansions, biclonal expansions, and diverse B-cell populations 
using the maximum cluster sizes and second maximum cluster 
sizes (Fig. 3B). We show that the CLL and LCL samples have max- 
imum cluster sizes >30% of the total reads compared with maxi- 
mum cluster sizes of healthy individual samples of <20%. How- 
ever, the LCLs and CLLs collectively occupy two distinct regions 
in this space. One group exhibits a single dominant clonal se- 
quence, where the remainder of the clusters are <5% of the total 
reads (Fig. 3B; surrounded by the dashed line). The second group 
of samples has two dominant clusters above 40% and 20% of 
the total reads, respectively. The CLL patient 5 repertoire com- 
prises two dominant clonal groups, each utilizing different V-D-J 
genes ([IGHV3-66*03/IGHD6-19*01/IGHJ3*02] and [IGHV6-1*01/ 
IGHD3-3*01/IGHJ4*02], respectively), where the two clones are 
unlinked and represented by completely different BCR sequences 
(Fig. 2D; Supplemental Fig. S8). Limited polyclonal expansions 
were also observed in 5/8 of the LCL samples, reflecting that EBV 
transformation of peripheral B-cells frequently results in poly- 
clonal LCLs. Using the data set from Boyd et al. (2009), we show 
the same phenomenon of polyclonal expansions in a subset of 
samples (patients with CLL/SLL and FL/SLL, Fig. 2E,iii) where the 
maximum cluster sizes are >35% and second maximum cluster 
sizes are >19% of the total reads (Supplemental Fig. S7B). There- 
fore, the polyclonal status of the tumor samples can be de- 
termined using B-cell network reconstruction and analysis. 

An important requirement of this approach is that the net- 
work diversity measures must not be highly dependent on the 
depth of sequencing (scale invariant) and volume of PB sample. If 
a given diversity measure is scale invariant for B-cell networks, 
then the network diversity measure should be the same regardless 
of the depth of sampled sequences, i.e., a subset of 454 sequences 
should yield the same network diversity measure as the full set of 
sequences. We tested all of the proposed population measures as 
a function of sequencing depth by randomly sampling different 
proportions of the sequence data for each sample, followed by 
calculation of the corresponding network parameters for both the 
vertex and cluster-size distributions for the LCL, CLL, and healthy 
samples. All of the proposed measures showed little variation at 
different sample sizes, even when subsampling was as low as 20% 
of the original data size (Supplemental Fig. S9). Below 20%, small 
deviations in the Gini Index measures are seen. As these network 



measures had minimal standard deviation over all sub-sampling 
ranges, they are therefore robust parameters for intersample 
comparison. 

Minimal effect of sequencing errors on network properties 

We determined whether clusters were likely to be due to the pro- 
cess of somatic hypermutation or sensitive to or generated through 
sequencing error of unique amplified BCR sequences. For a given 
BCR sequenced multiple times, such as when multiple B-cells ex- 
press identical BCRs, we estimated the expected number of vertices 
comprising a cluster that could be due to sequencing error given 
our experimentally derived PGR and sequencing error-rates. We 
find that all of the samples have cluster sizes greater than that 
expected due to error alone, even at twice the measured error-rate 
(Supplemental Fig. SIO). Therefore, the connectivity patterns of 
networks predominantly reveal differences in clonal expansions of 
B-cell populations rather than total sequencing errors. We propose 
that clusters identified in BCR networks are therefore derived from 
B-cells that share a common pro-B-cell progenitor with rearranged 
V-D-J that have subsequently expanded and diversified. 

Directly comparing the Gini Index measures of V-D-J se- 
quences from samples amplified independently by distinct primer 
sets (FRl or FR2 primer sets) showed a strong positive linear cor- 
relation between the two primer sets, with i^-values of 0.999 and 
0.996, respectively, for the vertex and cluster size diversities (Sup- 
plemental Fig. SUA). This supports a lack of PGR or sampling bias 
or an effect of sequencing errors for independent RT-PCRs with FRl 
compared with FR2 primer sets. Further, we find that networks 
generated to include edge lengths of up to five base changes faith- 
fully retain the network architecture for both the LCL and healthy 
individual samples (Supplemental Fig. SI IB). 

BCR repertoire network properties relate to CLL development 

To assess the sensitivity of this analysis method, we use the titra- 
tion experiment from Boyd et al. (2009), in which serial 10-fold 
dilutions of a known clonal CLL PB sample into normal peripheral 
blood was performed. We find 90.9% of all reads in the undiluted 
sample fall within the leukemic cluster (Fig. 3C,D). Using our 
methods, we can detect the leukemic clonal sequences at dilutions 
as low as 1:100,000. When the leukemic cluster sequences are 
unknown, detection of expanded clones relies on detecting the 
maximum cluster size that is significantly different from that of 
healthy individuals. We see significant increases in maximum 
cluster size above that of the healthy individual in CLL dilutions of 
1:100 or less. Therefore, deep sequencing of BCR repertoires po- 
tentially allows the detection of a clonal lymphoid population in 
a background of polyclonal cells without prior knowledge of the 
leukemic sequence types by comparing to healthy control BCR 
clonality (Sayala et al. 2007). 

We therefore sought to understand the relationship between 
the BCR population measures and the CLL clinical information for 
each patient. Interestingly, there was a strong correlation between 
the length of time since CLL diagnosis and the vertex Gini Index 
(Fig. 3E). This suggests that longer disease times lead to larger ver- 
tices representing larger tumor clonal populations, in agreement 
with previous studies (Kelly et al. 2002; Hayes et al. 2010). 

By BCR deep-sequencing and network analysis, we hypothe- 
size that we can follow the dynamics of dominant clonal pop- 
ulations at multiple time points. To test this, we sampled a stage B 
CLL patient (patient A) during the course of therapy. A pretreatment 
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sample was taken immediately before entering the second cycle 
of Chlorambucil treatment, and the second sample was taken 28 d 
later. The BCRs were sequenced on the MiSeq platform, yielding 
40,414 and 36,197 high-quality reads, respectively The most 
abundant BCR sequences in the two samples were identical. Net- 
work construction shows a single dominant cluster at both time 
points, which decreases from 86% to 53% of total reads, reflecting 
the reduction in WBC (Fig. 4A). This corresponds to the vertex Gini 
Index reducing from 0.892 to 0.713 and the cluster Gini Index re- 
ducing from 0.244 to 0.138. A composite network was generated of 
all of the sequences from the dominant clusters in both time-points, 
where the vertex sizes correspond to frequencies of each BCR at 
either the pre- or post-treatment samples (Fig. 4B). The proportions 
of each unique BCR sequence between the pre- and post-treatment 
samples give a strong linear relationship (i^-value = 0.999995) (Fig. 
4C), where the post-treatment proportional frequencies are 62% of 
those in the pretreatment sample, indicating that all BCR clones 
within the leukemic cluster are equally affected by this treatment. 

To understand the evolution of the leukemic cluster over 
time, a maximum parsimony tree was fitted (Schliep 2011) en- 
compassing sequences that were observed at least six times for the 
two samples (Fig. 4D). Bootstrapping was performed to evaluate 



the reproducibility of the trees, showing strong tree support (>95% 
certainty for all branches). A total of 122/231 of the BCR sequences 
are unique to the pretreatment sample (e.g.. Fig. 4, Clone 1) com- 
pared with 13/231 unique to the post-treatment sample (e.g.. Fig. 4, 
Clone 2). However, the sequences that are primarily observed in the 
post-treatment sample show divergence from the dominant leuke- 
mic clone, suggesting leukemic cluster BCR evolution during treat- 
ment (e.g.. Clone 2). A similar analysis was performed on an in- 
dependent data set from Boyd et al. (2009) for a patient with chronic 
lymphocytic leukemia and small lymphocytic lymphoma samples 
separated by 3 mo (Supplemental Fig. SI 2) showing distinct clonal 
diversification patterns. 

Discussion 

Deep sequencing of B-cell and T-cell repertoires offers the potential 
for quantitative understanding of the adaptive immune system in 
health and disease. Here, we use deep sequencing of B-cell receptor 
V-D-J population frequencies and novel analyses of BCR reper- 
toires at the level of clonal populations. The observation of fre- 
quent multiple identical BCR sequences in tumors and much lower 
frequency in identical BCR sequences in PB from healthy in- 
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Figure 4. B-cell leukemic clonal evolution. (A)The B-cell sequence networks for patient A with chronic lymphocytic leukemia for samples (i) prior to and 
(ii) after second cycle of Chlorambucil treatment, separated by 1 mo with corresponding white blood cell counts. (B) All sequences from the dominant 
clusters from both temporal samples were used to generate a composite network, and the differential frequencies at each time point are indicated by the 
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dividuals suggests that we rarely sequence multiple identical RNA 
molecules from a single B-cell. Therefore, clusters of related se- 
quences are likely to represent BCRs from clonal expansions of 
evolutionarily related B-cells, whereas naive B-cell populations 
form singletons in sparsely connected networks. The effects of RT- 
PCR or sequencing error and amplification bias on our analysis, 
often of concern for deep sequencing, are minimal. We show 
a strong linear correlation between the network parameters of 
samples that have been RT-PCR amplified using different primer 
sets to distinct regions of the IGH variable RNA transcript, sug- 
gesting that the PGR methods here have limited primer or am- 
plification bias. We confirm the dominant clonal sequences for 
the CLL patients by Sanger sequencing, and show that in all cases 
the samples have cluster sizes notably greater than that expected 
due to the measured total process error-rate. Therefore, these 
observed V-D-J clusters are likely to have undergone mutational 
processes greater than process errors. 

We define for the first time B-cell V-D-J sequence population 
measures that describe the clonality of the sequences and quantify 
both the effect of B-cell sequence diversification (cluster size) and 
clonal proliferation (vertex size) using the Gini Index as an un- 
evenness measure. The maximum and second maximum cluster 
size is used to assess dual clonal expansions. If the B-cell network 
from limited sequencing is a random sample of the entire circu- 
lating peripheral blood BGR repertoire, then a scale invariant di- 
versity measure should also capture the predominant structure of 
the unsampled network. We show that network structures com- 
bined with these population measures discriminate between B-cell 
repertoires of different clonalities in health and disease. These 
measures are robust to variations in sequencing and sampling 
depth and different filtering strategies, and are applicable to in- 
dependently produced data sets (Boyd et al. 2009). Using differ- 
ent primer sets, sequencing depths, and sequencing technologies, 
the samples still cluster according to the clonal nature of the 
samples, occupying the equivalent distinct regions of Gini Index 
and maximum/second maximum graphs. Therefore, this analytical 
strategy is applicable to any BGR deep-sequencing technology. 

We observed variation between the BGR repertoires in healthy 
individuals and in GLL. One healthy individual showed a more 
developed network, defined by an increase in connectivity, with 
corresponding higher Gini Index values and larger maximum 
cluster sizes compared with the other healthy individuals. This 
clone was not germ line in sequence and could be a result of an- 
tigen-specific memory B-cell expansion or an undiagnosed ma- 
lignant transformation. Variation in network structures between 
individual healthy zebrafish BGR repertoires was also observed in 
the study by Ben-Hamo and Efroni (2011), where higher connectiv- 
ity suggested an immune response within the individual. Similarly 
in GLL, assessing the maximum and second maximum cluster sizes, 
we identify patients with more than one BGR clonal expansion, 
where the two dominant clones have different V-D-J gene usages. 
This may be due to either the expansion of two distinct malignant 
B-cell transformations or separate antigen-stimulated B-cell clonal 
expansion unrelated to GLL. These methods used in time-series may 
allow the distinction between antigen-driven positive selection in 
GDRs compared with malignant-driven expansion (Supplemental 
Fig. S5). Multiple separate clonal B-cell populations have been ob- 
served in previously published data in a subset of patients identified 
by different V and J chain usages (Hsi et al. 2000; Boyd et al. 2009), 
but the clinical significance of these findings are not known. 

Time-dependent evolution of BGR networks may, however, pro- 
vide a powerful means of assessing B-cell tumor evolution and re- 



sponse to therapy as well as the dynamics of a healthy B-cell repertoire. 
The GLL vertex Gini Index is correlated with the time an individual has 
been living with GLL (Fig. 3E). This coupled with the observation of 
in vivo evolution of BGR clones in GLL (Fig. 4) suggests that BGR 
sequencing in GLL may provide an additional prognostic value for 
the disease. Divergent evolution from a common leukemic ancestor 
has previously been observed in GLL, possibly through the accumu- 
lation of driver mutations with selective advantages in growth over 
other subclones (Gampbell et al. 2008). Hypermutations within the 
IGH locus may also play a driver role in clonal expansions (Ghia and 
Galigaris-Gappio 2006). Therefore, BGR sequencing and subsequent 
network and evolutionary analysis may play an important role in 
identifying population changes. However, an evolutionary model 
for BGR diversity in health and disease, similar to the models used in 
infectious disease phylogenetics is needed to fully explore these 
possibilities. Nevertheless, for the first time we show the short-term 
effect of therapy on the B-cell repertoire in GLL and demonstrate 
how networks lend themselves to phylogenetic approaches. These 
methods are sensitive and informative for characterizing of B-cell 
populations in health and B-cell malignancies. 

Methods 

Samples 

Peripheral blood mononuclear cells (PBMGs) were isolated from 10 
mL of whole blood from healthy volunteers and GLL patients using 
FicoU gradients (GE Healthcare). Total RNA was isolated using TRIzol 
and purified using RNeasy Mini Kit (Qiagen), including on-column 
DNase digestion according to the manufacturer's instructions. Total 
RNA was also isolated from 1 X 10^ cells from human lympho- 
blastoid cell lines (LGLs) from the HapMap project (The International 
HapMap Gonsortium 2007). Research was approved by relevant in- 
stitutional review boards and ethics committees (07/MRE05/44). 

RT-PCR 

RT-PGR reagents were purchased from Invitrogen. The FRl and FR2 
primer sets used (supplied by Sigma Aldrich) are described by van 
Dongen et al. (2003) and in Supplemental Table SI. Reverse transcrip- 
tion was performed using 500 ng of total RNA mixed with 1 |jlL of JH 
reverse primer (1 |jlM), 1 |jlL of dNTPs (0.25 mM), and RNase free water 
added to make a total volume of 1 1 |jlL. This was incubated for 5 min at 
65°G and 4 |jlL of First strand buffer, 1 |jlL of DTT (0.1 M), 1 |jlL of 
RNaseOUT Recombinant Ribonuclease Inhibitor, and 1 |jlL of Super- 
Script III reverse transcriptase (200 units/ixL) were added. RT was per- 
formed at 50°G for 60 min before heat-inactivation at 70°G for 15 min. 
PGR amplification of cDNA (5 fxL of the RT product) was performed 
with the JH reverse primer and the FRl or FR2 forward primer set pools 
(0.25 |jlM each) using 0.5 |jlL of Phusion High-Fidelity DNA Polymerase 
(Finnzymes), 1 |jlL of dNTPs (0.25 mM), 1 |jlL of DTT (0.25 mM) per 50 
|jlL of reaction. The following PGR program was used: 3 min at 94°G, 35 
cycles of 30 sec at 94°G, 30 sec at 60°G, and 1 min at 72°G, with a final 
extension cycle of 7 min at 72°G on an MJ Thermocycler. 

Sequencing methods 

454 libraries were prepared using standard Roche 454 Rapid Prep 
protocols incorporating 10-base multiplex identifier (MID) tags 
and sequenced using a 454 GS FLX Titanium (Roche) or by 250-bp 
paired-ended MiSeq (Illumina). Raw 454 or MiSeq reads were fil- 
tered for base quality (median >32) using the QUASR program 
(http://sourceforge.net/projects/quasr/) (Watson et al. 2013). MiSeq 
forward and reverse reads were merged together if they contained an 
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identical overlapping region of >65 bp, or otherwise discarded. Non- 
immunoglobulin sequences were removed and only reads with sig- 
nificant similarity to reference IGHV genes from the IMGT database 
(Lefranc et al. 2009) using BLAST (Altschul et al. 1990) were retained 
(1 X 10~^°£-value threshold). Primer sequences were trimmed from 
the reads, and sequences retained for analysis only if both primer 
sequences were identified and if sequence lengths were >255 bp or 
195 bp for FRl and FR2 primed samples, respectively, for 454, or 
both forward and reverse reads >1 10 bp for MiSeq. FRl -primed PGR 
samples from CLL patients were also Sanger sequenced. 

Per-base error quantification 

The same PGR protocol and read quality filtering was used to am- 
plify beta actin, beta hemoglobin, and GAPDH genes from two 
healthy individuals (amplicon sizes of 150 bp and 340 bp, re- 
spectively). The sequence representing the majority of the reads 
for each sample was classified as the "true'' gene sequence for that 
individual to account for individual allelic variation. Any differ- 
ences between this sequence and the reads were considered to be 
a PGR and/or sequencing error and classified as homopolymeric 
indels (occurring in a region of two or more consecutive identical 
bases), nonhomopolymeric indels, or mismatches. 

Reference-based V-D-] assignment 

BLAST (Altschul et al. 1990) was used to align the 454 sequences 
against known BGR sequences from the ImMunoGeneTics (IMGT) 
database (Lefranc et al. 2009). Due to the difference in length of the 
different gene families, different BLAST E-value thresholds were 
used for the IGHV, IGHD, and IGHJ genes (10"^^, 10"^ and 10"^°, 
respectively). 

Network assembly and analysis 

The network generation algorithm is summarized in Figure 2A and 
Supplemental Figure SI. Briefly each vertex represents a unique 
sequence, where the relative size of the vertex is proportional to 
the number of sequence reads identical to the vertex sequence. 
Edges were calculated between vertices that differed by single nu- 
cleotide non-indel differences. The network analyses were per- 
formed using igraph implemented in R (http://igraph.sourceforge. 
net/index.html). The distribution of mismatches within a single 
network cluster were determined by aligning the sequence repre- 
senting the largest vertex with the sequences to which it is con- 
nected, and the positions of mismatches were determined along the 
sequences. Two-sided t-tests were performed in R. 

Diversity measure calculations 

The Gini index was calculated by ordering the cluster sizes from the 
largest to smallest and creating a cumulative frequency distribution, 
where R= {ri, rz, . . . , r„} r/ is the cumulative size of all of the largest 
clusters until the i*^ largest cluster and normalized such that r„ = 1 . The 
Gini index is 

Gini index(g) = 2^ — ^ , 

where N is the number of clusters (Morrow 1977). 

Estimation of cluster sizes due to sequencing error 

The Poisson distribution can estimate the expected number of reads 
containing / errors from the (central) vertex of size n reads, given an 



estimated error rate. The expected number of sequences with / errors 
is n.pi, where 



Pi=P{X = i) 



i\ 



and A is the expected number of mutations per read. A cluster is 
defined as a set of interconnected vertices in which edges are 
generated between vertices that differ by a single base. A vertex v is 
only included in a cluster when the minimum distance from v to 
any of the sequences in the cluster containing the central vertex is 
one. Thus, all of the sequencing errors at i=l generate vertices that 
have edges connecting to the central vertex. At z >i, a vertex with 
a set of mutations will be connected to the cluster only if there 
exists a vertex in the cluster with a set of mutations My such that 



-.\{xeM^\xiMy}\ = l 



(i.e., there is only one mutation in Mx that is not in My). Therefore, 
the probability of vertices due to / sequencing errors is estimated by 
drawing S[n,i] samples from a multinomial distribution, for 
which the probability of the possible vertices that could connect 
to the cluster is given by 



■Pv 



where / is the length of the sequence and E[n,f\ is the estimated 
number of vertices that are in the cluster that are at a distance of / 
from the central node. Here, we draw 1000 independent samples 
from the multinomial distribution to estimate the average number 
of vertices at distances / from the central vertex and, therefore, the 
cluster size due to sequencing error can be estimated by summing 
over the expected number of vertices at all /, 1 < / < oo. 

Temporal evolution of dominant clones 

Sequences from the dominant clusters that were observed at least 
six times for the two samples underwent a multiple alignment 
using the GlustalW2 algorithm (www.ebi.ac.uk/Tools/clustalw2/ 
index.html) with default parameters. A phylogenetic tree was fitted 
using the unrooted parsimony methods implemented in R (http:// 
cran.r-project.org/web/packages/phangorn/). Model tests were per- 
formed on different substitution models, for which the JG+G+I 
substitution model was found to be optimum and thus used here. A 
total of 1000 bootstrap samples of individual nucleotides in the 
multiple alignment was used to assess the reproducibility of the 
phylogenetic trees. The proportional difference in expression, diff(i), 
of a given sequence / between month 0 and month 3 was calculated 
by the difference in expression between the two time points and 
divided by the sum of the expression of the sequence over both 
times: 



Difni) : 



Ei(t = 3)-Ei(t = 0) 
'E,-(f = 3)+f,-(t = 0)' 



where £/(f = 0) and £i(f = 3) is the expression of sequence / at 0 mo 
and 3 mo, respectively. 

Data access 

The IGH sequences discussed can be found under accession num- 
ber ERP002120 in the European Nucleotide Archive (ENA; http:// 
www.ebi.ac.uk/ena/). 
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